Extra 3.2 - Historical Provenance - Application 3: RRG Chat Messages

Identifying instructions from chat messages in the Radiation Response Game.

In this notebook, we explore the performance of classification using the provenance of a data entity instead of its dependencies (as shown here and in the paper). In order to distinguish between the two, we call the former historical provenance and the latter forward provenance. Apart from using the historical provenance, all other steps are the same as the original experiments.

Goal: To determine if the provenance network analytics method can identify instructions from the provenance of a chat messages.
Classification labels: $\mathcal{L} = \left\{ \textit{instruction}, \textit{other} \right\} $.
Training data: 69 chat messages manually categorised by HCI researchers.

Reading data

The RRG dataset based on historical provenance is provided in the rrg/ancestor-graphs.csv file, which contains a table whose rows correspond to individual chat messages in RRG:

First column: the identifier of the chat message
label: the manual classification of the message (e.g., instruction, information, requests, etc.)
The remaining columns provide the provenance network metrics calculated from the historical provenance graph of the message.

Note that in this extra experiment, we use the full (historical) provenance of a message, not limiting how far it goes. Hence, there is no $k$ parameter in this experiment.



In [1]:

    
import pandas as pd



In [2]:

    
filepath = "rrg/ancestor-graphs.csv"



In [3]:

    
df = pd.read_csv(filepath, index_col=0)
df.head()









    Out[3]:







  
    
      
      label
      entities
      agents
      activities
      nodes
      edges
      diameter
      assortativity
      acc
      acc_e
      ...
      mfd_e_a
      mfd_e_ag
      mfd_a_e
      mfd_a_a
      mfd_a_ag
      mfd_ag_e
      mfd_ag_a
      mfd_ag_ag
      mfd_der
      powerlaw_alpha
    
  
  
    
      21
      requests
      186
      7
      21
      214
      469
      7
      0.012152
      0.488348
      0.445533
      ...
      22
      19
      34
      22
      19
      0
      0
      0
      37
      2.924960
    
    
      20
      commissives
      183
      7
      20
      210
      461
      7
      0.007546
      0.487386
      0.446461
      ...
      22
      19
      33
      22
      19
      0
      0
      0
      37
      2.858642
    
    
      23
      assertives
      216
      7
      23
      246
      543
      7
      -0.001550
      0.489050
      0.447828
      ...
      26
      22
      38
      26
      19
      0
      0
      0
      46
      2.867888
    
    
      25
      instruction
      220
      7
      24
      251
      553
      7
      0.002591
      0.489752
      0.447110
      ...
      26
      22
      38
      26
      19
      0
      0
      0
      46
      2.891161
    
    
      24
      instruction
      219
      7
      24
      250
      551
      7
      0.002284
      0.489859
      0.447021
      ...
      26
      22
      38
      26
      19
      0
      0
      0
      46
      2.928098
    
  

5 rows × 23 columns

Labelling data

Since we are only interested in the instruction messages, we categorise the data entity into two sets: instruction and other.

Note: This section is just an example to show the data transformation to be applied on each dataset.



In [4]:

    
label = lambda l: 'other' if l != 'instruction' else l



In [5]:

    
df.label = df.label.apply(label).astype('category')
df.head()









    Out[5]:







  
    
      
      label
      entities
      agents
      activities
      nodes
      edges
      diameter
      assortativity
      acc
      acc_e
      ...
      mfd_e_a
      mfd_e_ag
      mfd_a_e
      mfd_a_a
      mfd_a_ag
      mfd_ag_e
      mfd_ag_a
      mfd_ag_ag
      mfd_der
      powerlaw_alpha
    
  
  
    
      21
      other
      186
      7
      21
      214
      469
      7
      0.012152
      0.488348
      0.445533
      ...
      22
      19
      34
      22
      19
      0
      0
      0
      37
      2.924960
    
    
      20
      other
      183
      7
      20
      210
      461
      7
      0.007546
      0.487386
      0.446461
      ...
      22
      19
      33
      22
      19
      0
      0
      0
      37
      2.858642
    
    
      23
      other
      216
      7
      23
      246
      543
      7
      -0.001550
      0.489050
      0.447828
      ...
      26
      22
      38
      26
      19
      0
      0
      0
      46
      2.867888
    
    
      25
      instruction
      220
      7
      24
      251
      553
      7
      0.002591
      0.489752
      0.447110
      ...
      26
      22
      38
      26
      19
      0
      0
      0
      46
      2.891161
    
    
      24
      instruction
      219
      7
      24
      250
      551
      7
      0.002284
      0.489859
      0.447021
      ...
      26
      22
      38
      26
      19
      0
      0
      0
      46
      2.928098
    
  

5 rows × 23 columns

Balancing data

This section explore the balance of the RRG datasets.



In [6]:

    
# Examine the balance of the dataset
df.label.value_counts()









    Out[6]:





other          37
instruction    32
Name: label, dtype: int64

Since both labels have roughly the same number of data points, we decide not to balance the RRG datasets.

Cross validation

We now run the cross validation tests on the datasets using all the features (combined), only the generic network metrics (generic), and only the provenance-specific network metrics (provenance). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.



In [7]:

    
from analytics import test_classification



In [8]:

    
results, importances = test_classification(df, n_iterations=1000)









    



Accuracy: 64.07% ±1.1212 <-- combined
Accuracy: 66.20% ±1.1259 <-- generic
Accuracy: 61.03% ±1.1090 <-- provenance

Results: Compared to the top accuracy achieved using forward provenance, 85%, using historical provenance in this application yield much lower accuracy, 66%. This supports our hypothesis that the forward provenance of a data entity correlates better with its nature/characteristic than its historical provenance (as the forward provenance records how the data entity was used).

	label	entities	agents	activities	nodes	edges	diameter	assortativity	acc	acc_e	...	mfd_e_a	mfd_e_ag	mfd_a_e	mfd_a_a	mfd_a_ag	mfd_der	powerlaw_alpha
21	requests	186	7	21	214	469	7	0.012152	0.488348	0.445533	...	22	19	34	22	19	37	2.924960
20	commissives	183	7	20	210	461	7	0.007546	0.487386	0.446461	...	22	19	33	22	19	37	2.858642
23	assertives	216	7	23	246	543	7	-0.001550	0.489050	0.447828	...	26	22	38	26	19	46	2.867888
25	instruction	220	7	24	251	553	7	0.002591	0.489752	0.447110	...	26	22	38	26	19	46	2.891161
24	instruction	219	7	24	250	551	7	0.002284	0.489859	0.447021	...	26	22	38	26	19	46	2.928098

	label	entities	agents	activities	nodes	edges	diameter	assortativity	acc	acc_e	...	mfd_e_a	mfd_e_ag	mfd_a_e	mfd_a_a	mfd_a_ag	mfd_der	powerlaw_alpha
21	other	186	7	21	214	469	7	0.012152	0.488348	0.445533	...	22	19	34	22	19	37	2.924960
20	other	183	7	20	210	461	7	0.007546	0.487386	0.446461	...	22	19	33	22	19	37	2.858642
23	other	216	7	23	246	543	7	-0.001550	0.489050	0.447828	...	26	22	38	26	19	46	2.867888
25	instruction	220	7	24	251	553	7	0.002591	0.489752	0.447110	...	26	22	38	26	19	46	2.891161
24	instruction	219	7	24	250	551	7	0.002284	0.489859	0.447021	...	26	22	38	26	19	46	2.928098